A Comparison of Three Document Clustering Algorithms: TreeCluster, Word Intersection GQF, and Word Intersection Hierarchical Agglomerative Clustering

نویسنده

  • Kenrick Mock
چکیده

This work investigated three techniques to automatically cluster a collection of documents: Word-Intersection with GQF, Word-Intersection with hierarchical agglomerative clustering, and TreeClustering. The Word-Intersection algorithms have been previously described in the literature while the TreeClustering technique is novel to this work. The TreeCluster algorithm idea comes from rule induction techniques and is used to generate a shallow tree of clusters that a user can browse. This algorithm is also O(n) when used with a fixed tree depth, as opposed to O(n) as the other two algorithms. Experimental results on a collection of Mail and Web documents indicate that the agglomerative clustering algorithm performed the best, but also the slowest. The GQF algorithm performed well on the Mail domain, but performed poorly on the Web domain without tuning to its heuristic. The TreeCluster algorithm performed reasonably well on both domains, and was also the fastest algorithm out of the three algorithms tested.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Dynamic Programming Approach To Document Clustering Based On Term Sequence Alignment

Document clustering is unsupervised machine learning technique that, when provided with a large document corpus, automatically sub-divides it into meaningful smaller sub-collections called clusters. Currently, document clustering algorithms use sequence of words (terms) to compactly represent documents and define a similarity function based on the sequences. We believe that the word sequence is...

متن کامل

Fast and Intuitive Clustering of Web Documents

Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing the results of a retrieval 6]. A person browsing the clusters can discover patterns that would be overlooked in the traditional ranked-list presentation. In this context, a document c...

متن کامل

Fuzzy Clustering Approach Using Data Fusion Theory and its Application To Automatic Isolated Word Recognition

 In this paper, utilization of clustering algorithms for data fusion in decision level is proposed. The results of automatic isolated word recognition, which are derived from speech spectrograph and Linear Predictive Coding (LPC) analysis, are combined with each other by using fuzzy clustering algorithms, especially fuzzy k-means and fuzzy vector quantization. Experimental results show that the...

متن کامل

Fast and Intuitive Clustering of Web DocumentsOren

Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing the results of a retrieval 4]. A person browsing the clusters can discover patterns that would be overlooked in the traditional ranked-list presentation. In this context, a document c...

متن کامل

Bayesian Hierarchical Clustering with Exponential Family: Small-Variance Asymptotics and Reducibility

Bayesian hierarchical clustering (BHC) is an agglomerative clustering method, where a probabilistic model is defined and its marginal likelihoods are evaluated to decide which clusters to merge. While BHC provides a few advantages over traditional distance-based agglomerative clustering algorithms, successive evaluation of marginal likelihoods and careful hyperparameter tuning are cumbersome an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004